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Abstract. We consider a bandit problem over a graph where the re- 
wards are not directly observed. Instead, the decision maker can com- 
pare two nodes and receive (stochastic) information pertaining to the 
difference in their value. The graph structure describes the set of possi- 
ble comparisons. Consequently, comparing between two nodes that are 
relatively far requires estimating the difference between every pair of 
nodes on the path between them. We analyze this problem from the 
perspective of sample complexity: How many queries are needed to find 
an approximately optimal node with probability more than 1 — 5 in the 
PAC setup? We show that the topology of the graph plays a crucial in 
defining the sample complexity: graphs with a low diameter have a much 
better sample complexity. 

1 Introduction 

We consider a graph where every edge can be sampled. When sampling an edge, 
the decision maker obtains a signal that is related to the value of the nodes 
defining the edge. The objective of the decision maker is to locate the node with 
the highest value. Since there is no possibility to sample the value of the nodes 
directly, the decision maker has to infer which is the best node by considering 
the differences between the nodes. 

As a motivation, consider the setup where a user interacts with a webpage. 
In the webpage, several links or ads can be presented, and the response of the 
user is to click one or none of them. Essentially, in this setup we query the 
user to compare between the different alternatives. The response of the user is 
comparative: a preference of one alternative to the other will be reflected in a 
higher probability of choosing the alternative. It is much less likely to obtain 
direct feedback from a user, asking her to provide an evaluation of the worth 
of the selected alternative. In such a setup, not all pairs of alternatives can be 
directly compared or, even if so, there might be constraints on the number of 
times a pair of ads can be presented to a user. For example, in the context of 
ads it is reasonable to require that ads for similar items will not appear in the 
same page (e.g., two competing brands of luxury cars will not appear on the same 
page). In these contexts, a click on a particular link cannot be seen as an absolute 
relevance judgement (e.g., [13]), but rather as a relative preference. Moreover, 
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feedback can be noisy and/or inconsistent, hence aggregating the choices into 
a coherent picture may be a non-trivial task. Finally, in such contexts pairwise 
comparisons occur more frequently than multiple comparisons, and are also more 
natural from a cognitive point of view (e.g., [52]). 

We model this learning scenario as bandits on graphs where the information 
that is obtained is differential. We assume that there is an inherent and unknown 
value per node, and that the graph describes the allowed (pairwise) comparisons. 
That is, nodes i and j are connected by an edge if they can be compared by a 
single query. In this case, the query returns a random variable whose distribution 
depends, in general, on the values of i and j. For the sake of simplicity, we 
assume that the observation of the edge between nodes i and j is a random 
variable that depends only on the difference between the values of i and j. Since 
this assumption is restrictive in terms of applicability of the algorithms, we also 
consider the more general setup where contextual information is observed before 
sampling the edges. This is intended to model a more practical setting where, 
say, a web system has preliminary access to a set of user profile features. 

In this paper, our goal is to identify the node with the highest value, a 
problem that has been studied extensively in the machine learning literature 
(e.g., |10ll| ). More formally, our objective is to find an approximately optimal 
node (i.e., a node whose value is at most e smaller than the highest value) with a 
given failure probability S as quickly as possible. When contextual information is 
added, the goal becomes to progressively fasten the time needed for identifying 
a good node for the given user at hand, as more and more users interact with 
the system. 

Related work. There are two common objectives in stochastic bandit prob- 
lems: minimizing the regret and identifying the "best" arm. While both ob- 
jectives are of interest, regret minimization seems particularly difficult in our 
setup. In fact, a recent line of research related to our paper is the Dueling Ban- 
dits Problem of Yue et al. |24l25j (see also [TI]). In the dualing bandit setting, 
the learner has at its disposal a complete graph of comparisons between pairs 
of nodes, and each edge (i,j) hosts an unknwon preference probability Pij to 
be interpreted as the probability that node i will be preferred over node j. 
Further consistency assumptions (stochastic transitivity and stochastic triangle 
inequality) are added. The complete graph assumption allows the authors to 
define a well-founded notion of regret, and analyze a regret minimization algo- 
rithm which is further enhanced in 25J where the consistency assumptions are 
relaxed. Although at first look our paper seems to deal with the same setup, we 
highlight here the main differences. First, the setups are different with respect 
to the topology handled. In 24 25j the topology is always a complete graph 
which results in the possibility to directly compare between every two nodes. 
In our work (as in real life) the topology is not a complete graph, resulting in 
a framework where a comparison of two nodes requires sampling all the edges 
between the nodes. In the extreme case of a straight line we need to sample all 
the given edges in the graph in order to compare the two nodes that are farthest 
apart. Second, the objective of minimizing the regret is natural for a complete 
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graph where it amounts to comparing a choice of the best bandit repeatedly 
with the actual pairs chosen. In a topology other than the complete graph this 
notion is less clear since one has to restrict choices to edges that are available. 
Finally, the algorithms in |24I25| are geared towards the elimination of arms 
that are not optimal with high probability. In our setup one cannot eliminate 
such nodes and edges because it is crucial in comparing candidates for optimal 
nodes. Therefore, the resulting algorithms and analyses are quite different. On 
the other hand, constraining to a given set of allowed comparisons leads us to 
make less general statistical assumptions than |24'25| , in that our algorithms are 
based on the ability to reconstruct the reward difference on adjacent nodes by 
observing their connecting edge. 

From a different perspective, the setup we consider is reminiscent of online 
learning with partial monitoring |19) . In the partial monitoring setup, one usu- 
ally does not observe the reward directly, but rather a signal that is related 
(probabilistically) to the unobserved reward. However, as far we know, the al- 
ternatives (called arms usually) in the partial monitoring setup are separate and 
there is no additional structure: when sampling an arm a reward that is related 
to this arm alone is obtained but not observed. Our work differs in imposing 
an additional structure, where the signal is derived from the structure of the 
problem where the signal is always relative to adjacent nodes. This means that 
comparing two nodes that are not adjacent requires sampling all the edges on 
a path between the two nodes. So that deciding which of two remote nodes has 
higher value requires a high degree of certainty regarding all the comparisons on 
the path between them. 

Another research area which is somewhat related to this paper is learning to 
rank via comparisons (a very partial list of references includes [7 13 8 4 14 5 12 2 3iiT5] ). 
Roughly speaking, in this problem we have a collection of training instances to 
be associated with a finite set of possible alternatives or classes (the graph nodes 
in our setting). Every training example is assigned a set of (possibly noisy or 
inconsistent) pairwise (or groupwise) preferences between the classes. The goal 
is to learn a function that maps a new training example to a total order (or rank- 
ing) of the classes. We emphasize that the goal in this paper is different in that 
we work in the bandit setup with a given structure for the comparisons and, in 
addition, we are just aiming at identifying the (approximately) best class, rather 
than ranking them all. 

Content of the paper. The rest of the paper is organized as follows. We start 
from the formal model in Section [2] We analyze the basic linear setup, where 
each node is comparable to at most two nodes in Section [3] We then move to the 
tree setup and analyze it in Section |4] The general setup of a network is treated 
in Section [5] Some experiments are then presented in Section [6] to elucidate 
the theoretical findings in previous sections. In Section [7] we discuss the more 
general setting with contextual information. We close with some directions for 
future research. 
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2 Model and Preliminaries 

In this section we describe the classical Multi- Armed Bandit (MAB) setup, de- 
scribe the Graphical Bandit (GB) setup, state/recall two concentration bounds 
for sequences of random variables, and review a few terms from graph theory. 

2.1 The Multi- Armed Bandit Problem 

The MAB model [HI is comprised of a set of arms A = When 
sampling arm i € A, a reward which is a random variable Ri, is provided. Let 
Ti = E [Ri] . The goal in the MAB setup is to find the arm with the highest 
expected reward, denoted by r* , where we term this arm's reward the optimal 
reward. An arm whose expected reward is strictly less than r* is called a non-best 
arm. An arm i is called an e-optimal arm if its expected reward is at most e from 
the optimal reward, i.e., E [i?^] > r* — e. In some cases, the goal in the MAB 
setup is to find an e-optimal arm. 

A typical algorithm for the MAB problem docs the following. At each time 
step t it samples an arm if and receives a reward Ri^ . When making its selection, 
the algorithm may depend on the history (i.e., the actions and rewards) up to 
time t — 1. Eventually the algorithm must commit to a single arm and select it. 
Next we define the desired properties of such an algorithm. 

Definition 1. (PAC-MAB) An algorithm is an (e, S) -probably approximately 
correct (or {e,S)-PAC) algorithm for the MAB problem with sample complex- 
ity T , if it terminates and outputs an e-optimal arm with probability at least 
1 — 5, and the number of times it samples arms before termination is bounded by 
T. 

In the case of standard MAB problems there is no structure defined over the 
arms. In the next section we describe the setup of our work where such a structure 
exists. 

2.2 The Graphical Bandit Problem 

Suppose that we have an undirected and connected graph G = {V, E) with 
nodes 1/ = {1, . . . , n} and edges E. The nodes are associated with reward values 
ri, . . . , r„, respectively, that are unknown to us. We denote the node with highest 
value by i* and, as before, r* = r^* . Define u = min^-^i* r^* — Vj to be the 
difference between the node with the highest value and the node with the second 
highest value. We call u the reward gap, and interpret it as a measure for how 
easy is to discriminate between the two best nodes in the network. As expected, 
the gap u has a significant influence on the sample complexity bounds (provided 
the accuracy parameter e is not large). We say that nodes i and j are neighbors if 
there is an edge in E connecting them (and denote the edge random variable by 
E"^^). This edge value is a random variable whose distribution is determined by 
the nodes it is connecting, i.e., (i, j)'s statistics are determined by and Vj. In 
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this work, we assume thalj^E [iJ*^] = rj — r^. Also, for the sake of concreteness, 
we assume the edge values are bounded in [—1, 1]. 

In this model, we can only sample the graph edges E^^ that provide in- 
dependent realizations of the node differences. For instance, we may interpret 
E^^ = +1 if the feedback we receive says that item j is preferred over item i, 
E"^^ = — 1 if i is preferred over j, and E"^^ = if no feedback is received. Then the 
reward difference Vj — ri becomes equal to the difference between the probability 
of preferring j over i and the probability of preferring i over j. Let us denote 
the realizations of E^^ by El'' where the subscript t denotes time. Our goal is to 
find an e-optimal node, i.e., a node i whose reward satisfies ri > r* — e. 

Whereas neighboring nodes can be directly compared by sampling its con- 
necting edge, if the nodes are far apart, a comparison between the two can only 
be done indirectly, by following a path connecting them. We denote a path be- 
tween node i and node j by TTy-. Observe that there can be several paths in G 
connecting i to j. For a given path tt from i to we define the composed edge 
value El^ by E^^ = !)e7rij ^^^^ ^tt — ^- telescoping, the average 
value of a composed edge E'^^ only depends on its endpoints, i.e., 

E[Eiq= E[i?'']- E in~ru)^r,-u, (1) 

independent of tt. Similarly, define E^ to be the time-t realization of the com- 
posed edge random variable E^ when we pull once all the edges along the path 
TT joining j to j. A schematic illustration of the the GB setup is presented in 
Figure 1. 




Fig. 1. Schematic illustration of the the GB setup for 6 nodes 



The algorithms we present in the next sections hinge on constructing reliable 
estimates of edge reward differences, and then combining them into a suitable 
node selection procedure. This procedure heavily depends on the graph topology. 
In a tree-like (i.e., acyclic) structure no inconsistencies can arise due to the 
noise in the edge estimators. Hence the node selection procedure just aims at 
identifying the node with the largest reward gap to a given reference node. On 

^ Notice that, although the graph is undirected, we view edge as a directed edge 
from i to j. It is understood that E^^ = —E^K 
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the other hand, if the graph has cycles, we have to rely on a more robust node 
elimination procedure, akin to the one investigated in [10] (see also the more 
recent [T]). 

2.3 Large Deviations Inequalities 

In this work we use Hoeffding's maximal inequality (e.g., [6]). 

Lemma 1. Let Xi, . . . , be independent random variables with zero mean 
satisfying Oi < Xi < bi w.p. 1. Let Si = S}=i ^j- Then, 

P { max Si> e] < exp ( t- — ^ ^ ) . 

3 Linear topology and sample complexity 

As a warm-up, we start by considering the GB setup in the case of a linear graph, 
i.e., E = {(«,« + 1) : 1 < z < n — 1}. We call it the linear setup. The algorithm 
for finding the highest node in the linear setup is presented in Algorithm [T] The 
algorithm samples all the edges, computes for each edge its empirical mean, and 
based on these statistics finds the highest edge. Algorithm [T] will also serve as a 
subroutine for the tree-topology discussed in Section|4j The following proposition 



Algorithm 1 The algorithm for the linear setup 

Input: e > 0, (5 > 0, line graph with edge set E = {{i,i -\- 1) : 1 < i < n — 1} 
1: for i = 1, . . . , n — 1 do 
2: Pull edge (i, i -|- 1) for T' times 

3: Let £;'''+^ = i llT=i El'^^^ be the empirical average of edge {i, i + 1) 

4: Let E];^^ = Y!kJ^ be the empirical average of the composed edge E^\^ 

where ir-a is the (unique) path from 1 to i. 
5: end for 

Output: Node fe = argmaXj^j^ 



gives the sample complexity of Algorithm [T] in the case when the edges are 
bounded. 

Proposition 1. // —1 < < 1 holds, then Algorithm^ operating on a 

linear graph with reward gap u is an {e,5)-PAC algorithm when the satisfy 

^T^) - max{e,M}2 {j) ' 

// T' ~ T then the sample complexity of each edge is T > max{" u}^ ^^g (|) • 
Hence the sample complexity of the algorithm is O ( maxle ttP ^^g (3) ) • 
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Proof. Let El-'+^ ^ ^'''+^~''^^r^'''^\ t = 1,...T\ Each E'/+^ has zero mean 
with -2/T' < E't'^^ < 2/T\ Hence 

^1,2 z?l-2 77^2,3 17^2,3 TTin— l,n £^n— l,n /q-v 

-C'l , ■ • • , ^j-l , -C'l , • ■ • , -C'y2 , ■ ■ ■ , £'1 ,...,-C/y„_i 1^/; 

is a sequence of Y^^=i zero-mean and independent random variables. Set for 
brevity e = max{e,M}, and suppose, without lost of generality, that some node 
j has the highest value. The probability that Algorithm [T] fails, i.e., returns a 
node whose value is e below the optimal value is bounded by 



El{. > and n < r.j - ?) . (3) 




> e 



< Pr (3 partial sum in ([2| with magnitude > e) 

< 2exp 



£2 



V Efei!Ef=i(2/r'=)V' 

where in the last inequality we used Lemma [l] Requiring this probability to be 
bounded by b yields the claimed inequality. □ 

The sample sizes in Proposition [l] encode constraints on the number of 
times the edges + 1) can be sampled. Notice that the statement therein 
implies Ti > ^^^^^^ log (|) for all i, i.e., we cannot afford in a line graph to 
undersample any edge. This is because every edge in a line graph is a bridge, 
hence a poor estimation of any such edge would affect the differential reward 
estimation throughout the graph. In this respect, this proposition only allows 
for a partial tradeoff among these numbers. 



4 Tree topology and its sample complexity 

In this section we investigate PAC algorithms for finding the best node in a tree. 
Let then G = (V, E) be an n-node tree with diameter D and a set of leaves 
Lev. Without loss of generality we can assume that the tree is rooted at node 
1 and that all edges are directed downwards to the leaves. Algorithm [2] considers 
all possible paths from the root to the leaves and treats each one of them as a 
line graph to be processed as in Algorithm [T] We have the following proposition 
where, for simplicity of presentation, we do no longer differentiate among the 
sample sizes T*-^ associated with edges (i, j). 
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Algorithm 2 The algorithm for the tree setup 
Input: e > 0, (5 > 0, tree graph with set of leaves L (- V 

1: for all leaves k £ L do 

2: Pull each edge (i, j) G iJ for T times 

3: Let — ^Yl't^i^t'' be the empirical average of edge and nik = 

argmaxjj^ i)e7ri k^ii maximum empirical average along path tth, (as in 

Algorithm [l]) 

4: end for 

Output: Node m = argmaxj,gj^mfc. 



Proposition 2. // —1 < E''^ < 1 holds, then Algorithm^ operating on a tree 
graph with reward gap u is an (e, S)-PAC algorithm when the sample complexity 

T of each edge satisfies T > ,nax{?M}2 ^^g (^^x^^ ■ Hence the sample complexity 
of the algorithm is O ( ,„4^„p log (t)) ■ 

Proof. The probability that Algorithm [2] returns a node whose average reward 
is e below the optimal one coincides with the probability that there exists a 
leaf k £ L such that Algorithm [T] operating on the linear graph tti j, singles out 
a node toj, whose average reward is more than e from the optimal one within 

TTi^fc. Setting T — ^'i^l^ log (^^j^^, with e = max{e,M}, ensures that the above 

happens with probability at most Hence each edge is sampled at most 

4P log {^j^^ times and the claim follows by a standard union bound over L. 

□ 



5 Network Sample Complexity 

In this section we deal with the problem of finding the optimal reward in a general 
connected and undirected graph G = (V, E), being \V\ = n. We describe a node 
elimination algorithm that works in phases, sketch an efficient implementation 
and provide a sample complexity. The following ancillary definitions will be 
useful. We say that a node is a local maximum in a graph if all its neighboring 
nodes do not have higher expected reward than the node itself. The distance 
between node i and node j is the length of the shortest path between the two 
nodes. Finally, the diameter D{G) of a graph G is the largest distance between 
any pair of nodes. 

Our suggested Algorithm operates in log n phases. For notational simplicity, 
it will be convenient to use subscripts to denote the phase number. We begin 
with Phase 1, where the graph Gi — {Vi,Ei) is the original graph, i.e., at 
the beginning all nodes are participating, and ni — \Vi\ — n. We then find a 
subgraph of Gi, which we call sampled graph denoted by Gf , that includes all the 
edges involved in shortest paths between all nodes in Vi. We sample each edge 
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in subgraph Gf for Ti-times, and compute the corresponding sample averages. 
Based on these averages, we find the local maxim£{^of Gf . 

The key observation is that there can be at most ni/2 maxima. Denote this 
set of maxima by ¥2- Now, define a subgraph, denoted by G2, whose nodes 
are V2. We repeat the process of getting a sampled graph, denoted by Gf . We 
sample the edges of the sampled graph Gf for T2-times and define, based on 
its maxima, a new subgraph. Denote the set of maxima by and the pro- 
cess continues until only one node is left. We call this algorithm NNE {Network 
Node Elimination), which is similar to the action elimination procedure of |10) 
(see also Ij). The algorithm is summarized in Algorithm p| Two points should 



Algorithm 3 The Network Node Elimination Algorithm 
Input: e > 0, (5 > 0, graph G = {V,E), i = 1 
1: Initialize Gi=G,Vi = V 

2: Compute the shortest path between all pairs of nodes of Gi, and denote each path 
by TTij . 

3: Initialize the shortest path set by SPi = {vrij|j, j G Vi} 
4: while |Vi| > 1 do 
5: n. = \V,\ 

6: Using the shortest paths in SPi, find a sampled graph Gf of Gi 
7: A = D{Gf) 

8: Pull each edge in Gf for Ti times 

9: Find the local maxima set, Vi+i, on Gf , and get a subgraph Gi+i that contains 

10: SP^+i = {tt.j e SP^\i,j G V^+i} 
11: i <- i + 1 
12: end while 

Output: The remaining node 



be made regarding the NNE algorithm. First, as will be observed below, the 
sequence {D(Gf of diameters is nonincreasing. Second, from the imple- 

mentation viewpoint, a data-structure maintaining all shortest paths between 
nodes is crucial, in order to efficiently eliminate nodes while tracking the short- 
est paths between the surviving nodes of the graph. In fact, this data structure 
might just be a collection of n breadth-first spanning trees rooted at each node, 
that encode the shortest path between the root and any other node in the graph. 
When node i gets eliminated, we first eliminate the spanning tree rooted at i, 
but also prune all the other spanning trees where node i occurs as a leaf. If i is 
a (non-root) internal node of another tree, then i should not be eliminated from 
this tree since i certainly belongs to the shortest path between another pair of 
surviving nodes. Note that connectivity is maintained through the process. 

The following result gives a PAC bound for Algorithm [3] in the case when 
the E'^'^ are bounded. 



^ Ties can be broken arbitrarily. 
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Proposition 3. Suppose that —1 < i?*'-' < 1 for every G E. Then Algo- 

rithm\^ operating on a general graph G with diameter D and reward gap u is an 
{e,d)-PAC algorithm with edge sample complexity 

J. < ^'-1 — ^ log — < ^ log — — 

(niax{e, w}/logn) \(>/^ognJ (max{e,-u}/logn) V^/log 

Proof. In each phase we have at most half the nodes of the previous phase, 
i.e., < ni/2. Therefore, the algorithm stops after at most logn phases. 

Also, because we retain shortest path between the surviving nodes, we also have 
Di+i < Di < D. At each phase, similar to the previous sections, we make 
sure that it is at most 5/ logn the probabihty of identifying an e/ log n-optimal 
node. Therefore, it suffices to pull the edges in each sampled graph Gf for 

Ti < 7 1- — T-^T2 log I -rrP — I times. Hence the overall sample complexity 

' — (max{e,ti}/ log(n))^ '^yd/lognj -t^ f J 

for an (e, (5)-PAC bound is at most X^j'^'i" claimed. The last inequality just 

follows from n^+i < ni/2 and Di < D for all i. □ 

Being more general, the bound contained in Proposition |3] is weaker than the 
ones in previous sections when specialized to line graphs or trees. In fact, one 
is left wondering whether it is always convenient to reduce the identification 
problem on a general graph G to the identification problem on trees by, say, 
extracting a suitable spanning tree of G and then invoking Algorithm [2] on it. 
The answer is actually negative, as the set of simulations reported in the next 
section show. 



6 Simulations 

In this section we briefiy investigate the role of the graph topology in the sample 
complexity. 

In our simple experiment we compare Algorithm [2] (with two types of span- 
ning trees) to Algorithm [S] over the "spider web graph" illustrated in Figure |2] 
(a). This graph is made up of 15 nodes arranged in 3 concentric circles (5 nodes 
each), where the circles are connected so as to resemble a spider web. Node re- 
wards are independently generated from the uniform distribution on [0,1], edge 
rewards are just uniform in [-1,-|-1]. The two mentioned spanning trees are the 
longest diameter spanning tree (diameter 14) and the shortest diameter spanning 
tree (diameter 5) . As we see from Figure [2] (b) , the latter tends to outperform 
the former. However, both spanning tree-based algorithms are eventually out- 
performed by NNE on this graph. This is because in later stages NNE tends to 
handle smaller subgraphs, hence it needs only compare subsets of "good nodes". 



7 Extensions 



We now sketch an extension of our framework to the case when the algorithm 
receives contextual information in the form of feature vectors before sampling 
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(a) (b) 

Fig. 2. (a) The spider-web topology. (6) Empirical error vs. time for the graph setup in 
(a) and spanning trees thereof. Three algorithms are compared: NNE (red solid line), 
the tree-based algorithm operating on a smallest diameter spanning tree (black dashed 
line), and the tree-based algorithm operating on a largest diameter spanning tree (blue 
dash-dot line). The parameters are n = 15 and e = 0. Average of 200 runs. 

the edges. This is intended to model a more practical setting where, say, a web 
system has preliminary access to a set of user profile features. 

This extension is reminiscent of the so-called contextual bandit learning set- 
ting (e.g., [17j). also called bandits with covariates (e.g., |21|). In such a setting, 
it is reasonable to assume that different users Xs have different preferences (i.e., 
different best nodes associated with), but also that similar users tend to have 
similar preferences. A simple learning model that accommodates the above (and 
is also amenable to theoretical analysis) is to assume each node i of G to host a 
linear function tt^ : a; — >■ ujx where, for simplicity, \ \ui\ \ — \ \x\\ = 1 for all i and 
X. The optimal node i*{x) corresponding to vector x is i*{x) — argmax^gv uj x. 
Our goal is to identify, for the given x at hand, an e-optimal node j such that 
uJ X > uJtX — e. Again, we do not directly observe node rewards, but only the 
differential rewards provided by edges[^When we operate on input x and pull 
edge («,j), we receive an independent observation of random variable E^^{x) 
such that E[E'^x)] = uJx - uJx. 

Learning proceeds in a sequence of stages s = 1, . . . ,S, each stage being in 
turn a sequence of time steps corresponding to the edge pulls taking place in 
that stage. In Stage 1 the algorithm gets input Xi, is allowed to pull (several 
times) the graph edges E^^ (xi), and is required to output an e-optimal node for 
Xi. Let T{xi) be the sample complexity of this stage. In Stage 2, we retain the 
information gathered in Stage 1, receive a new vector X2 (possibly close to a;i) 
and repeat the same kind of inference, with sample complexity T{x2). The game 
continues until S stages have been completed. 



^ For simplicity of presentation, we disregard the reward gap here. 
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For any given sequence Xi, X2, . . ., xs, one expects the cumulative sample 
size X]f=i T{xs) to grow less than linearly in S. In other words, the additional 
effort the algorithm makes in the identification problem diminishes with time, 
as more and more users are interacting with the system, especially when these 
users are similar to each other, or even occur more than once in the sequence Xi, 
X2, ■ ■ ., Xs- In fact, we can prove stronger results of the following kind. Notice 
that the bound does not depend on the number S of stages, but only on the 
dimension of the input space 

Proposition 4. Under the above assumptions, if G — {V^E) is a connected and 
undirected graph, with n nodes and diameter D, and Xi, X2, xs S R"^ is 
any sequence of unit-norm feature vectors, then with probability at least 1 ~ 6 a 
version of the NNE algorithm exists which outputs at each stage s an e-optimal 
node for Xg, and achieves the following cumulative sample size 



Y,T{xs)^0{B\og^B) 



where B 



(£/ logri)^ o\^5/log, 

Proof (Sketch). The algorithm achieving this bound combines linear-regression- 
like estimators with NNE. In particular, every edge of G maintains a linear 
estimator ■u'-' intended to approximate the difference Uj — over both stages 
and sampling times within each stage. At stage s and sampling time t within 
stage s, the vector Ug^ suitably stores all past feature vectors Xi, . . . ,Xs observed 
so far, along with the corresponding edge reward observations. By using tools 
from ridge regression in adversarial settings (see, e.g., |9j), one can show high- 
probability approximation results of the form 

{ul^ '^x - {uj - u.Yxf < x'^A-lx (^d log Ss,t + log ^) , (4) 

being Ss,t — J2k<s-i ^(^fe) + ^''^d As,t the matrix 

As^t=I+ ^ T{xk)xkx]. +tXsxJ . 

k<s-l 

In stage s, NNE is able to output an e-optimal node for input x^ as soon as the 
RHS of Q is as small as ce^, for a suitable constant c depending on the current 
graph topology NNE is operating on. Then the key observation is that in stage s 
the number of times we sample an edge (i, j) such that the above is false cannot 
be larger than 



ce2 ° \A 



s,0\ 



log —r. d log £'5,T(x,) + log 



* A slightly different statement holds in the case when the input dimension is infinite. 
This statement quantifies the cumulative sample size w.r.t. the amount to which the 
vectors xi, X2, . . ., xs are close to each other. Details are omitted due to lack of 
space. 
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where | ■ | is the determinant of the matrix at argument. This follows from 
standard inequalities of the form X^Hi''' ^s,t^s < log ^—fj^^ff^- n 

8 Discussion 

This paper falls in the research thread of analyzing online decision problems 
where the information that is obtained is comparative between arms. We ana- 
lyzed a simple setup where the structure of comparisons is provided by a given 
graph which, unlike previous works on this subject [.24,25_. i lead us focus on the 
notion of finding an e-optimal arm with high probability. We then described an 
extension to the important contextual setup. There are several issues that call 
for further research that we outline below. 

First, we only addressed the exploratory bandit problem. It would be inter- 
esting to consider the regret minimization version of the problem. While naively 
one can think of it as a problem with an arm per edge of the graph, this may 
not be a very effective model because the number of arms may go as but the 
number of parameters grows like n. On top of this, definining a meaningful no- 
tion of regret may not be trivial (see the discussion in the introductory section). 
Second, we only considered graphs as opposed to hypergraphs. Considering com- 
parisons of more than two nodes raises interesting modeling issues and well as 
computational issues. Third, we assumed that all samples are equivalent in the 
sense that all the pairs we can compare have the same cost. This is not a realistic 
assumption in many applications. An approach akin to budgeted learning |20) 
would be interesting here. Fourth, we focused on upper bounds and construc- 
tive algorithms. Obtaining lower bounds that depend on the network topology 
would be interesting. The upper bounds we have provided are certainly loose 
for the case of a general network. Furthermore, more refined upper bounds are 
likely to exist which take into account the distance on the graph between the 
good nodes (e.g., between the best and the second best ones). In any event, the 
algorithms we developed for the network case are certainly not optimal. There 
is room for improvement by reusing information better and by adaptively se- 
lecting which portions of the network to focus on. This is especially interesting 
under smoothness assumptions on the expected rewards. Relevant references in 
the MAB setting to start off with include |2ll5l3j . 
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